LLM Powered Content Extraction
This document explains the LLM-powered content extraction system that processes placement-related emails using Google Gemini (via LangChain and LangGraph). The system implements a four-stage pipeline:
Intelligent email classification using keyword scoring and confidence thresholds
Robust information extraction guided by strict schema requirements and JSON formatting
Validation and enhancement of extracted data
Privacy sanitization to remove sensitive metadata
It also documents the Pydantic models used for data representation, the LangGraph state machine, integration with LangChain/LangGraph, and practical guidance for retry mechanisms, error handling, and data sanitization.
The system is organized around services and clients that encapsulate responsibilities:
Services: PlacementService (extraction pipeline), EmailNoticeService (non-placement notices), PlacementNotificationFormatter (notification formatting)
Clients: GoogleGroupsClient (email fetching)
Core: Configuration and logging utilities
Orchestration: main.py coordinates fetching and processing
LangGraph pipeline"] ENS["EmailNoticeService
LangGraph pipeline"] PN["PlacementNotificationFormatter
Pydantic models"] end subgraph "Clients" GGC["GoogleGroupsClient
IMAP email fetching"] end subgraph "Core" CFG["config.py
Settings & logging"] DB["DatabaseService
MongoDB ops"] end MAIN --> PS MAIN --> ENS PS --> GGC ENS --> GGC PS --> DB ENS --> DB PS --> PN CFG --> PS CFG --> ENS
Diagram sources
Section sources
PlacementService: Implements the four-stage LangGraph pipeline for placement offers, including classification, extraction, validation/enhancement, and privacy sanitization.
EmailNoticeService: Processes non-placement notices via a separate LangGraph pipeline with LLM-based classification and extraction.
PlacementNotificationFormatter: Formats extracted data into notices using Pydantic models.
GoogleGroupsClient: Fetches unread emails from Google Groups via IMAP and extracts forwarded metadata.
DatabaseService: Persists notices and placement offers to MongoDB and computes statistics.
Configuration: Centralized settings and logging utilities.
Key implementation highlights:
LangGraph StateGraph with typed state dictionaries
Strict JSON extraction prompts with privacy rules
Retry logic with bounded attempts for validation errors
Pydantic models for strong schema enforcement
Privacy sanitization removing headers and forwarded metadata
Section sources
The system orchestrates email fetching, classification, extraction, validation, and persistence. The primary flow is driven by PlacementService’s LangGraph pipeline.
Diagram sources
PlacementService: Four-Stage Pipeline#
State management: GraphState defines the pipeline state (email, classification flags, extracted data, validation errors, retry count).
Classification: Keyword scoring over sender, subject, and body; confidence aggregation; threshold-based decision.
Extraction: LLM prompt enforces strict schema and JSON output; privacy rules disallow headers/forwarding metadata.
Validation and Enhancement: Pydantic validation, consistency checks, defaults assignment for roles/packages.
Privacy Sanitization: Removes headers/forwarded markers from extracted fields.
Conditional Edges: Decide whether to extract based on relevance/confidence; retry extraction on validation errors.
Keyword scoring + confidence"] Classify --> IsRelevant{"Relevant?
confidence >= 0.6"} IsRelevant --> |No| Display["Display Results"] IsRelevant --> |Yes| Extract["Extract Info
LLM JSON + schema"] Extract --> Validate["Validate & Enhance
Pydantic + defaults"] Validate --> HasErrors{"Has validation errors?"} HasErrors --> |Yes| Retry{"Retry attempts < 3?"} Retry --> |Yes| Extract Retry --> |No| Display HasErrors --> |No| Sanitize["Sanitize Privacy
Remove headers/forwarded"] Sanitize --> Display["Display Results"] Display --> End(["End"])
Diagram sources
Section sources
Classification Algorithm: Keyword Scoring and Confidence#
Placement keywords: presence increases score (bounded weight).
Company indicators: presence adds modest weight.
Negative keywords: reduce confidence (spam/security indicators).
Heuristics: presence of names, numbers, email formats.
Threshold: classification is relevant if confidence >= 0.6.
sender + subject + body"] --> B["Compute keyword scores"] B --> C["Aggregate confidence
placement + company + heuristics"] C --> D["Subtract spam/security penalties"] D --> E{"confidence >= 0.6?"} E --> |Yes| F["Mark relevant"] E --> |No| G["Mark not relevant"]
Diagram sources
Section sources
Extraction Prompt Engineering and JSON Formatting#
Two-phase prompt:
Phase 1: Final placement offer classification with strict criteria.
Phase 2: Structured extraction with strict schema and privacy rules.
Output format: Raw JSON only, no markdown or explanations.
Privacy rules: Do not include headers, sender info, or forwarded markers in extracted fields.
Diagram sources
Section sources
Validation and Enhancement#
Pydantic validation ensures schema compliance.
Consistency checks: company name length, presence of students/roles, number_of_offers alignment.
Enhancement: default role/package assignment when single role exists; normalize counts.
Diagram sources
Section sources
Privacy Sanitization#
Removes email headers and forwarded markers from extracted fields (additional_info, roles.package_details, job_location).
Ensures no sender/forwarding metadata appears in user-facing content.
Diagram sources
Section sources
Pydantic Models and State Management#
Student: name, enrollment_number, email, role, package
RolePackage: role, package, package_details
PlacementOffer: company, roles, job_location, joining_date, students_selected, number_of_offers, additional_info, email_subject, email_sender, time_sent
GraphState: email, is_relevant, confidence_score, classification_reason, rejection_reason, extracted_offer, validation_errors, retry_count
Diagram sources
Section sources
Integration with LangChain and LangGraph#
ChatGoogleGenerativeAI configured with model and temperature.
StateGraph with nodes for classify, extract_info, validate_and_enhance, sanitize_privacy, display_results.
Conditional edges for decision-making and retries.
Diagram sources
Section sources
Email Fetching and Orchestration#
GoogleGroupsClient fetches unread emails, parses bodies, and extracts forwarded metadata/time.
main.py orchestrates email processing: fetch IDs, iterate emails, try PlacementService first, then EmailNoticeService, persist results, and mark read.
Diagram sources
Section sources
External dependencies include LangChain, LangGraph, and Pydantic for LLM integration, state management, and schema enforcement. Internal dependencies show clear separation of concerns:
PlacementService depends on GoogleGroupsClient, ChatGoogleGenerativeAI, and DatabaseService.
EmailNoticeService depends on GoogleGroupsClient, ChatGoogleGenerativeAI, and PlacementPolicyService.
PlacementNotificationFormatter depends on Pydantic models and DatabaseService.
Diagram sources
Section sources
Sequential processing of emails: safer and allows granular error handling and retry logic.
Retry limits: bounded attempts (e.g., 3) prevent infinite loops and reduce LLM cost.
JSON parsing and validation: early failure detection reduces downstream processing overhead.
Privacy sanitization: minimal overhead via regex-based cleaning; applied only when needed.
Logging and daemon mode: configurable logging minimizes I/O impact in production.
[No sources needed since this section provides general guidance]
Common issues and resolutions:
Empty or malformed LLM response: treated as non-placement offer; pipeline proceeds to display results.
JSON parsing failures: retry up to configured limit; on exhaustion, record validation errors and rejection reason.
Validation errors (schema mismatch): retry with bounded attempts; otherwise mark as invalid.
Privacy leakage: ensure privacy sanitization runs after extraction; verify headers/forwarded markers are removed.
Email fetching failures: check credentials and network connectivity; re-run with verbose logging.
Operational tips:
Use verbose logging to inspect confidence scores and classification reasons.
Monitor retry counts and validation errors to identify prompt/schema drift.
Verify forwarded metadata extraction and sanitization for accurate timestamps and sender attribution.
Section sources
The LLM-powered content extraction system leverages Google Gemini through LangChain and LangGraph to deliver a robust, schema-driven pipeline for placement offers. Its four-stage design—classification, extraction, validation/enhancement, and privacy sanitization—ensures high-quality, privacy-compliant outputs. Strong Pydantic models, retry logic, and careful privacy handling make the system resilient and maintainable.
[No sources needed since this section summarizes without analyzing specific files]
Example Input/Output Transformations#
Input: Email subject/body with forwarded headers and metadata.
Transformation: Headers and forwarded markers removed; LLM extracts JSON aligned to PlacementOffer schema.
Output: PlacementOffer object persisted to database with derived metadata and sanitized fields.
[No sources needed since this section provides general guidance]